Dependent Bigram Identification
نویسنده
چکیده
Dependent bigrams are two consecutive words that occur together in a text more often than would be expected purely by chance. Identifying such bigrams is an important issue since they provide valuable clues for machine translation, word sense disambiguation, and information retrieval. A variety of significance tests have been proposed (e.g., Church et. al., 1991, Dunning, 1993, Pedersen et. al, 1996) to identify these interesting lexical pairs. In this poster I present a new statistic, minimum sensitivity, that is simple to compute and is free from the underlying distributional assumptions commonly made by significance tests. The challenge in identifying dependent bigrams is that most are relatively rare regardless of the amount of text being considered. This follows from the distributional tendencies of individual bigrams as described by Zipf’s Law. If the frequencies of the bigrams in a text are ordered from most to least frequent, (fl, f~, ..., f,,), these frequencies roughly obey fi oc Consider the following example from a 1,300,000 word sample of the ACL/DCI Wall Street Journal Corpus. A contingency table containing the frequency counts of oil and industry is shown below. These counts show that oil industry occurs 17 times, oil occurs without industry 240 times, industry occurs without oil 1001 times, and bigrams other than oil industry occur 1,298,742 times. This distribution is sparse and skewed and thus violates a central assumption implicit in significance testing of contingency tables (l~ead Cressie 1988).
منابع مشابه
Language identification incorporating lexical information
In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to prov...
متن کاملDouble bigram-decoding in phonotactic language identification
In this paper a phonotactic language identi cation system that employs a multilingual phone-recognizer with multiple language-dependent grammars to tokenize the spoken signal into several phone-streams is described. For each stream an independent set of language models is used to compute the language scores that are subsequently processed by two classi cation stages. Thus, the system acquires i...
متن کاملA Machine Learning Approach for the Identification of Bengali Noun-Noun Compound Multiword Expressions
This paper presents a machine learning approach for identification of Bengali multiword expressions (MWE) which are bigram nominal compounds. Our proposed approach has two steps: (1) candidate extraction using chunk information and various heuristic rules and (2) training the machine learning algorithm called Random Forest to classify the candidates into two groups: bigram nominal compound MWE ...
متن کاملChinese Unknown Word Identification Based on Local Bigram Model with Integrally Smoothing Assumption
The paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. To explain this local approximation, we make an “integrally smoothing assumption”. As a simplifica...
متن کاملChinese Unknown Word Identification Based on Local Bigram Model
This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine thes...
متن کاملType and token bigram frequencies for two-through nine-letter words and the prediction of anagram difficulty.
Recent research on anagram solution has produced two original findings. First, it has shown that a new bigram frequency measure called top rank, which is based on a comparison of summed bigram frequencies, is an important predictor of anagram difficulty. Second, it has suggested that the measures from a type count are better than token measures at predicting anagram difficulty. Testing these hy...
متن کامل